111 research outputs found
Multi-lingual Common Semantic Space Construction via Cluster-consistent Word Embedding
We construct a multilingual common semantic space based on distributional
semantics, where words from multiple languages are projected into a shared
space to enable knowledge and resource transfer across languages. Beyond word
alignment, we introduce multiple cluster-level alignments and enforce the word
clusters to be consistently distributed across multiple languages. We exploit
three signals for clustering: (1) neighbor words in the monolingual word
embedding space; (2) character-level information; and (3) linguistic properties
(e.g., apposition, locative suffix) derived from linguistic structure knowledge
bases available for thousands of languages. We introduce a new
cluster-consistent correlational neural network to construct the common
semantic space by aligning words as well as clusters. Intrinsic evaluation on
monolingual and multilingual QVEC tasks shows our approach achieves
significantly higher correlation with linguistic features than state-of-the-art
multi-lingual embedding learning methods do. Using low-resource language name
tagging as a case study for extrinsic evaluation, our approach achieves up to
24.5\% absolute F-score gain over the state of the art.Comment: 10 page
Cold-start universal information extraction
Who? What? When? Where? Why? are fundamental questions asked when gathering knowledge about and understanding a concept, topic, or event. The answers to these questions underpin the key information conveyed in the overwhelming majority, if not all, of language-based communication. At the core of my research in Information Extraction (IE) is the desire to endow machines with the ability to automatically extract, assess, and understand text in order to answer these fundamental questions. IE has been serving as one of the most important components for many downstream natural language processing (NLP) tasks, such as knowledge base completion, machine reading comprehension, machine translation and so on. The proliferation of the Web also intensifies the need of dealing with enormous amount of unstructured data from various sources, such as languages, genres and domains.
When building an IE system, the conventional pipeline is to (1) ask expert linguists to rigorously define a target set of knowledge types we wish to extract by examining a large data set, (2) collect resources and human annotations for each type, and (3) design features and train machine learning models to extract knowledge elements. In practice, this process is very expensive as each step involves extensive human effort which is not always available, for example, to specify the knowledge types for a particular scenario, both consumers and expert linguists need to examine a lot of data from that domain and write detailed annotation guidelines for each type. Hand-crafted schemas, which define the types and complex templates of the expected knowledge elements, often provide low coverage and fail to generalize to new domains. For example, none of the traditional event extraction programs, such as ACE (Automatic Content Extraction) and TAC-KBP, include "donation'' and "evacuation'' in their schemas in spite of their potential relevance to natural disaster management users. Additionally, these approaches are highly dependent on linguistic resources and human labeled data tuned to pre-defined types, so they suffer from poor scalability and portability when moving to a new language, domain, or genre.
The focus of this thesis is to develop effective theories and algorithms for IE which not only yield satisfactory quality by incorporating prior linguistic and semantic knowledge, but also greater portability and scalability by moving away from the high cost and narrow focus of large-scale manual annotation. This thesis opens up a new research direction called Cold-Start Universal Information Extraction, where the full extraction and analysis starts from scratch and requires little or no prior manual annotation or pre-defined type schema. In addition to this new research paradigm, we also contribute effective algorithms and models towards resolving the following three challenges:
How can machines extract knowledge without any pre-defined types or any human annotated data? We develop an effective bottom-up and unsupervised Liberal Information Extraction framework based on the hypothesis that the meaning and underlying knowledge conveyed by linguistic expressions is usually embodied by their usages in language, which makes it possible to automatically induces a type schema based on rich contextual representations of all knowledge elements by combining their symbolic and distributional semantics using unsupervised hierarchical clustering.
How can machines benefit from available resources, e.g., large-scale ontologies or existing human annotations? My research has shown that pre-defined types can also be encoded by rich contextual or structured representations, through which knowledge elements can be mapped to their appropriate types. Therefore, we design a weakly supervised Zero-shot Learning and a Semi-Supervised Vector Quantized Variational Auto-Encoder approach that frames IE as a grounding problem instead of classification, where knowledge elements are grounded into any types from an extensible and large-scale target ontology or induced from the corpora, with available annotations for a few types.
How can IE approaches be extent to low-resource languages without any extra human effort? There are more than 6000 living languages in the real world while public gold-standard annotations are only available for a few dominant languages. To facilitate the adaptation of these IE frameworks to other languages, especially low resource languages, a Multilingual Common Semantic Space is further proposed to serve as a bridge for transferring existing resources and annotated data from dominant languages to more than 300 low resource languages. Moreover, a Multi-Level Adversarial Transfer framework is also designed to learn language-agnostic features across various languages
Incremental Prompting: Episodic Memory Prompt for Lifelong Event Detection
Lifelong event detection aims to incrementally update a model with new event
types and data while retaining the capability on previously learned old types.
One critical challenge is that the model would catastrophically forget old
types when continually trained on new data. In this paper, we introduce
Episodic Memory Prompts (EMP) to explicitly preserve the learned task-specific
knowledge. Our method adopts continuous prompt for each task and they are
optimized to instruct the model prediction and learn event-specific
representation. The EMPs learned in previous tasks are carried along with the
model in subsequent tasks, and can serve as a memory module that keeps the old
knowledge and transferring to new tasks. Experiment results demonstrate the
effectiveness of our method. Furthermore, we also conduct a comprehensive
analysis of the new and old event types in lifelong learning.Comment: Accepted to COLING'22 Main Conference (Short paper). 9 pages, 2
figures, 3 table
A Survey of Document-Level Information Extraction
Document-level information extraction (IE) is a crucial task in natural
language processing (NLP). This paper conducts a systematic review of recent
document-level IE literature. In addition, we conduct a thorough error analysis
with current state-of-the-art algorithms and identify their limitations as well
as the remaining challenges for the task of document-level IE. According to our
findings, labeling noises, entity coreference resolution, and lack of
reasoning, severely affect the performance of document-level IE. The objective
of this survey paper is to provide more insights and help NLP researchers to
further enhance document-level IE performance
PaperRobot: Incremental Draft Generation of Scientific Ideas
We present a PaperRobot who performs as an automatic research assistant by
(1) conducting deep understanding of a large collection of human-written papers
in a target domain and constructing comprehensive background knowledge graphs
(KGs); (2) creating new ideas by predicting links from the background KGs, by
combining graph attention and contextual text attention; (3) incrementally
writing some key elements of a new paper based on memory-attention networks:
from the input title along with predicted related entities to generate a paper
abstract, from the abstract to generate conclusion and future work, and finally
from future work to generate a title for a follow-on paper. Turing Tests, where
a biomedical domain expert is asked to compare a system output and a
human-authored string, show PaperRobot generated abstracts, conclusion and
future work sections, and new titles are chosen over human-written ones up to
30%, 24% and 12% of the time, respectively.Comment: 12 pages. Accepted by ACL 2019 Code and resource is available at
https://github.com/EagleW/PaperRobo
- …